TO: 


Ed Wickham 


DATE: February 26, 1968 


FROM: John E. Tindall 

SUBJECT: Outliers 

The table below provides the information necessary for establishing rules by 
which an outlier may be eliminated from a sample. As you will discover by 
examining the table the rule of thumb of rejecting observations which fall at 
least three standard deviations from the sample mean is fairly acceptable for 
a specific set of circumstances which happens to coincide with the conditions under 
which Cigaret Intelligence data is usually evaluated. 

Suppose a sample of N observations x 0 , , x 2 , —, x^-1 has been obtained and 

one of the observations, x 0 , is so far from the others that it appears to have 
resulted from something other than the usual random error. Eliminating x 0 from 
the sample is equivalent to rejecting the hypothesis that x Q is an observation 
from the same population as x-|, x 2 , —, X N-1* test t ^ 11 ' s hyp° tfies i s can be 
formulated as a one-sided or two-sided test and can be performed at any desired 
level of significance. 

As you know, the decision to reject x 0 from the sample can be based on the number 
of standard deviations between the sample mean and x Q . This, number is a function 
of the sample size, N; the level of significance, a; and whether the test is one¬ 
sided or two-sided. 
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A one-sided test is one in which only values which are too low (or high) are 
of concern while a two-sided test is one in which it is assumed that both 
observations which are too low and too high may be encountered and are of concern. 

In the strictest sense, in the type of work you do, two-sided tests should be used. 
Practically, however, the one-sided test may be adequate for many applications. 

In particular, TPM measurements in which it is quite likely that there will be an 
erroneous low value but not likely that there will be an erroneous high value, could 

be tested for outlines with a one-sided test. 

The values in the body of the table are the number of standard deviations that the 

observation x 0 must lie from the mean of the sample in order for x Q to be rejected 
from the sample. (The standard deviation and mean referred to above must be based on 
the observations Xp x^, —, x^exclusive of x q .) Numbers are given in the table 
for values of a and N which I assume will meet your needs. Only two significant 
figures are given because interpolation was necessary in the tables I used. In all 
instances, the numbers in the table will lead to slightly conservative tests. 
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Table I: Distance from Mean* Necessary to Eliminate an Observation 


n\ 

a = 

One-Sided 

; .05 

Two-Sided 

a = 

One-Sided 

.01 

Two-Sided 

10 

3.2 SD** 

3.4 SD** 

4.0 SD** 

4.7 SD** 

20 

3.1 

3.3 

3.8 

4.6 

30 

3.1 

3.4 

3.8 

4.4 

40 

3.1 

3.5 

3.9 

4.3 

80 

3.3 

3.6 

3.9 

4.2 

160 

3.4 

3.7 

! 


320 

3.6 

3.8 



640 

3.8 

4.0 




* Mean based on x-j, x 2 , —- , x N _-j 

** Standard Deviation based on x-|, Xg, --- , x N _-j 
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Using all of the observations x Q , x-j, --- , x N _-| to calculate the mean and 
standard deviation which are used to test x Q will lead to a slightly conservative 
test which will reject x Q no more than it would be using the mean and standard 
deviation based on x-| , X£, — , X N-1’ 

The test for an outlier should, of course, only be used when no cause can be 
found for the suspect observation and when it is unreasonable to assume that the 
observation resulted from variation inherent in the population being measured. 


/jc 


cc: C. E. Badgett 
G. R. Berman 
P. A. Eichorn 
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